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Abstract 



The Lasso is a very well known penalized regression model, which adds an L\ penalty 
with parameter Ai on the coefficients to the squared error loss function. The Fused Lasso 
extends this model by also putting an L\ penalty with parameter A2 on the difference of 
neighboring coefficients, assuming there is a natural ordering. In this paper, we develop 
a fast path algorithm for solving the Fused Lasso Signal Approximator that computes 
the solutions for all values of Ai and A2. In the supplement, we also give an algorithm for 
the general Fused Lasso for the case with predictor matrix X 6 R nxp with rank(X) = p. 

1 Introduction 

In recent years, many regression procedures have been proposed that use penalties on the 
regression coefficients in order to achieve sparseness or shrink them towards zero. One of 
the most widely known procedures of this type is the Lasso (see Tibshirani [1996]), which 
minimizes the loss function 



Here, y 6 R™ is the response vector, X e R nxp is the matrix of predictors and (3 e W 
the coefficient vector. Several years after the original Lasso paper was published, the LARS 
algorithm was developed (see Efron et al. [2004]), which after a small adjustment gives 
the whole solution path of the Lasso for the penalty parameter Ai with the computational 
complexity of an ordinary least squares problem. Subsequently, path algorithms for several 
other regression methods were developed as well, for example for generalized linear models 
(see Park and Hastie [2007]) or the SVM (see Hastie et al. [2004]) among others. A more 
general treatment of conditions under which the solution paths are piecewise linear can be 
found in Rosset and Zhu [2007]. 

An example of an extension of the Lasso is the Fused Lasso introduced in Tibshirani 
et al. [2005]. For the Fused Lasso, it is assumed that there is some natural ordering of the 
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coefficients (e.g. each coefficient corresponds to a position on a straight line). If coefficients 
in the true model are closely related to their neighbors, we can exploit this by placing an 
additional penalty on the differences of neighboring coefficients. Several different choices for 
these penalties on neighboring coefficients are possible and in the case of the Fused Lasso, 
an Li penalty is being used. The resulting loss function is then 

^(y - X/3) T (y - X/3) + \ 1 £ + A 2 £ |A - f3 i+1 \. 

i=i i=i 

The second penalty with parameter A 2 shrinks neighboring coefficients towards each other. 
Just as the L\ penalty on the absolute values for the Lasso encourages sparseness, the 
penalty on \ — tends to set neighboring penalties exactly equal to each other. As such, 
the method is especially suitable for coefficients that are constant for an interval and change 
in jumps. 

In this article, we first want to concentrate on the most widely used case for this method, 
the Fused Lasso Signal Approximator (FLSA). In the FLSA, we assume that we have X = I 
as the predictor matrix. One example for this would be comparative genomic hybridization 
(CGH) or chromosomal microarray analysis (CMA) data. CGH is a method that identifies 
DNA copy number gains and losses on chromosomes by making two color fluorescence in situ 
hybridization at various points of the chromosomes. In this technique, normal and tumor 
DNA are labeled with fluorescent dyes (e.g. red and green) and using a microarray analysis, 
regions of increased or decreased fluorescence of one color compared to the other can be 
identified, indicating gains or losses of DNA at this place of the chromosome. As usual with 
this type of data, it is very noisy. Therefore, we seek to exploit that gains or losses typically 
appear for whole regions in the genome and that these changes usually occur in jumps. We 
can do this by penalizing differences of neighboring coefficients and therefore decrease the 
noise in the data and improve estimation. In this case, we use the one-dimensional Fused 
Lasso Signal Approximator (FLSA), for which the loss function is 

1 n n n—1 

Hy, p) = g Z>« - a) 2 + Ai iai + A 2 E ia - ft+ii- 

i=l i=l i=l 

Every coefficient $ is an estimate of the measurement yi taken at position i (which we 
assume to be ordered along the chromosome). Apart from the Lasso penalty Ai J2i=i I A I? the 
additional penalty placed on the difference between neighboring coefficients is A 2 YH=i I A — 
An example of CGH measurements in lung cancer can be seen in Figure 1. The red 
line are the estimates for penalty parameters Ai = and A 2 = 2. We can see that starting 
around measurement 150, the CGH results are on average below 0, indicating a loss of DNA 
in this region. 

Another example where the Fused Lasso model can be used is in image reconstruction. 
As a toy example, look at Figure 2. On the left hand side, we can see the true image and a 
noisy version in the middle. On the right hand side is the denoised version using the Fused 
Lasso. As the coefficients are not located on a straight line but instead on a 2-D grid, we 
have to use a different version of the penalty that penalizes all differences of neighboring 
coefficients in 2 dimensions. 

In its more general form, we assume that each coefficient corresponds to a node in a graph 
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Figure 1: Example using the one-dimensional Fused Lasso Signal Approximator on lung 
cancer CGH data. 
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Figure 2: Example of image reconstruction using the Fused Lasso. On the left hand side 
is the true image. The noisy version is in the middle. The reconstructed version using the 
Fused Lasso is on the right. 



Q = {V,E). Then we penalize every difference of coefficients if the corresponding nodes have 
an edge between them. Specifically, the loss function becomes in this case 

1 n n 

s s (s,i)£E;s<t 

which we will refer to as the general Fused Lasso Signal Approximator (FLSA). In the example 
above, the graph is a 2-D grid. 

These examples are only special cases of a more general Fused Lasso model. In this more 
general form, the Fused Lasso loss function is 

1 P 
L(y, X,/3) = -(y - X/3) T (y - X/3) + A x £ + A 2 £ |ft - fy\ 

i=l (i,j)£E,i<j 
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where as above E are the edges in the graph Q = (V, E) with V — {1, . . . ,p} representing the 
variables. Due to space constraints, the algorithm for the general Fused Lasso for matrices 
X G R n ' p where rank(X) = p and general graphs Q will be given in the Online Supplement 
in Section 4. For other matrices with rank(X) < p we do not develop a path algorithm as 
the solution path w.r.t. A2 can have discontinuities in this case. In the following sections, 
we will present path algorithms for the Fused Lasso Signal Approximator in its special one- 
dimensional and its general form. In Section 2 we will use the special structure of the 
one-dimensional Fused Lasso Signal Approximator to derive a fast algorithm that calculates 
the entire solution path with a complexity of nlogn. The general FLSA will be treated in 
Section 3 and is more complicated than the one-dimensional case due to the general structure 
of the penalty graph Q. We prove that our algorithm yields the exact solution and present 
simulation studies that compare the new algorithms to existing methods. Finally, in Section 
4 we will discuss the results and give possible extensions of these algorithms. 

2 One-dimensional Fused Lasso Signal Approximator 

We already used the one-dimensional FLSA in the CGH data example above. Now, in order 
to develop a path algorithm, we will first have another look at the loss function we seek to 
minimize: 

n n n—1 

L (y> P) = 2 Z>* - + Ai E l#l + A2 E l# ~ M- 

i=l i=l i=l 

Due to the simple structure of the loss function, it is possible in this case to obtain the 
solution for any value of (Ai,A 2 ) by simple soft-thresholding of the solution obtained for 
(0, A 2 ). To be more precise, the following theorem holds: 

Theorem 1. Assume we have X = I and that the solution for A x = and A 2 > is known 
and denote it by (3(0, A 2 ). Then the solution for Ai > is 

A(Ai,A 2 ) = ^n(A(0,A 2 ))(|A(0,A 2 )|-A 1 ) + for i = l,...,p. 

The proof of this theorem is presented in Friedman et al. [2007]. It should be noted that 
it also holds for the FLSA for arbitrary graphs Q, so that it can also be used for the more 
general FLSA algorithm. For the rest of this section, we assume that Ai = 0. The algorithm 
presented is a path algorithm that finds the solution for all possible values of A 2 . Any solution 
for a Ai 7^ can then be obtained by simply soft-thresholding as shown above. 

The path algorithm will start by setting A 2 = and then increase it until all coefficients 
Pi have the same value. For increasing A 2 , neighboring coefficients are forced to be equal to 
each other. Once this happens, these coefficients are being fused and subsequently treated 
as a single variable for increasing A 2 . In order to be able to do this, it is important to make 
sure that these coefficients cannot become unfused again for increasing A 2 . This is in fact the 
case and will be shown below. Before getting into the details of the algorithm, it is necessary 
to define some notation. 

2.1 Algorithm 

In order to develop the algorithm for the one-dimensional case, we first have to define what 
exactly the sets of fused coefficients are. 
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Definition 1. Let Fi, i — 1, . . . ,n F (\ 2 ) be the sets of coefficients at A 2 that are considered 
to be fused where n F (\ 2 ) is the number of such sets. In order for these sets to be valid, every 
set Fi has to be of the form Fi = {k\li < k < Ui} and the following statements have to hold 
as well: 



• Assuming the Fi are ordered, for every k,l G Fi we have f3 k (\ 2 ) = f3i(\ 2 ) and for 
k G F h l G F i+ i it holds that (3 k (\ 2 ) ^ A(A 2 ). 

For notational convenience, write f3 Fi (\ 2 ) for any f3 k (\ 2 ) with k G Fi and also suppress 
the dependency of Fi onto A 2 . Using this definition of fused sets, let us now turn to the 
algorithm. 

For the one-dimensional FLSA, a special result holds that makes the algorithm especially 
simple and also has been presented in Friedman et al. [2007]. Loosely speaking, it states that 
if coefficients are fused at A°, then these coefficients will also be fused for any A 2 > A°. To 
be more precise 

Theorem 2. Let f3 k (\ 2 ) be the optimal solution to the one- dimensional FLSA problem for 
coefficient k and penalty parameter A 2 . Then if for some k and \ 2 it holds that P k (\ 2 ) = 
/3fe+i(A 2 ), then for any A 2 > A 2 it holds that f3 k (\ 2 ) = fl k+ i(\ 2 ). 

A proof of this theorem is provided in Friedman et al. [2007] and an alternative proof 
using a different technique is given in the Online Supplement in Section 2.3. 

Using this theorem, the algorithm is very simple. First, we need a starting point for the 
path algorithm and as the optimal solution is known for A 2 = 0, which is just (3 k (0) = y k 
for all k, we use it to begin the path. Then the algorithm calculates step by step when two 
neighboring sets have equal coefficients and merges them. In order to calculate the slope 
of the coefficient paths for some value of A 2 , we assume that we know f3 k (\ 2 ) for all k as 
well as the sets of fused variables Fi. Using this information, we can calculate the derivative 
of /?fc(A 2 ) with respect to A 2 and we will see that these are actually constant, so that the 
resulting solution path is a piecewise linear function where the breakpoints occur when two 
set of coefficients are being fused. In order to find d/3 k (X 2 ) / d\ 2 , define the loss function L F ^ 2 
that incorporates the fused sets F,. This is done by taking the loss function L in Equation 
(1) and replacing f3 k by (3 Fi for k G Fi. This constrained loss function is then 



Due to the assumptions on the sets F, the constrained loss function L F ^ 2 is always differen- 
tiable with respect to f3 Fi unless two coefficients are just being fused (which only happens if 
&Fi — Pf 1+1 for some i). Assuming that f3 Fi is optimal, the derivative of L F) \ 2 has to be and 
we get 



. U£™F i = {l,...,n} 
• Fi n Fj = % + j 




9L Fj \ 2 
~Wf~ 



jeFi 

+ A 2 sign(f3 Fi - /3.F i+ i) = for i = 1, . . . ,n F (X 2 ) 



Fi\P Fi ~ Vi + A2 si § n (^ - &i_J+ 
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Algorithm 1: One-dimensional FLSA path algorithm 



initialize 

A 2 = 0; 

(3k = Vk for k = 1, . .. ,n; 
Fj = {«} for i = 1, . . . , n; 
uf = n; 
end 

while n F > 1 do 

Calculate next hitting time h(X 2 ); 

Let (io(A 2 ),i (A 2 ) + 1) = argmin ?li i+1 ( A2 ) >A2 h i>i+1 (X 2 ) be the indices of the sets to 
fuse next; 

Fuse the two sets F io ^x 2 ) and F io ^ 2 )+i', 
Set A 2 := h(X 2 ); 

Update the values for f3 k {X 2 ) , 9/3 qI^ and set n F — n F — 1; 
end 



where we set sign^^ — (3 Fo ) = and siga(p FnF(x ^ — (Aa)+1 ) = 0. Now, taking derivatives 
with respect to A 2 in these equations gives the results 

= - jpT| (sign(/3 Fi - /Vi) + sign(/? F . - /? F . +1 )) for i = 1, . . . , n F (A 2 ) 

As we already mentioned above, these are constant as long as the sets of fused coefficients do 
not change. By Theorem 2, we know that the only way to change the sets of fused coefficients 
is to merge two sets as they can never split for increasing A 2 . Therefore, the solution path is 
piecewise linear and for increasing A 2 the breakpoint occurs when two groups fuse. Thus, it 
is easy to calculate the next breakpoint, which occurs when neighboring sets have the same 
coefficients. In order to do this, define 



, , Af,(A 2 ) - (3 F 
hi,i + i{\ 2 ) = g^— ^ + A 2 for i = l,..., n F {\ 2 ) - 1 

c9A 2 c9A 2 



which is the value for A 2 at which the coefficients of the sets Fj and F i+1 have the same 
value and can be fused, assuming that no other coefficients become fused before that. If 
^i,i+i(A 2 ) < A 2 , these values are being ignored as the two groups Fj and F i+i are actually 
moving apart for increasing A 2 . The next value at which coefficients are fused is therefore 
the hitting time 

h(X 2 ) = min h i)i+1 (X 2 ). 

As we are taking the minimum, it is only defined if there is at least one h i<i+1 > A 2 . From 
equation (1) with Ai = we can easily see that for A 2 — > oo the solution is (3 k = - J2^ =1 yi 
for all k, thus only one group exists for large A 2 . Therefore, if n F (X 2 ) > 2, then there exists 
an hij+i > A 2 and therefore h(X 2 ) is defined. Based on these results, we can now write out 
the details of the algorithm that provides the entire solution path and they can be found in 
Algorithm 1. 

As we will show below, this algorithm only requires a low number of computational steps 
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Example of simulated dataset 
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Figure 3: An example of a simulated dataset with size n = 100 for the one-dimensional 
FLSA. 

and is of complexity nlog(n). Of course, apart from the computational complexity, it is also 
important to be able to save the results in an efficient manner. This can be done with memory 
usage on the order 0{n). A more detailed analysis can be found in the Online Supplement 
in Section 2. 

2.2 Speed comparison 

In order to evaluate the speed of our new algorithm, we want to compare it to other methods 
that have been published before. The first alternative we also use is the component-wise 
algorithm presented in Friedman et al. [2007]. The second is based on the general convex 
solver CVX, a package for specifying and solving convex problems (see Grant and Boyd 
[2008a, b]). CVX is very easy to use and flexible, which is why we chose it, despite the 
disadvantage that it cannot be used with a warm start. 

As datasets of a wide range of sizes are needed, the speed comparisons will be performed 
on simulated data. The simulated dataset consists of datapoints with values of 0, 1 and 
2. Roughly 20% of datapoints will have value 1 and 20% value 2. An example plot of a 
simulated dataset of size n = 100 can be seen in Figure 3. 

When calculating the solution, our new path algorithm and the competing methods take 
somewhat different approaches. Our algorithm calculates the whole solution path whereas 
the competing method calculate the solution only for a prespecified list of A 2 values. In order 
to make the two approaches comparable we measure the time each algorithm takes to find 
the solutions for 50 values of A 2 which are equally spaced between and 1. The results of 
the comparison can be found in Table 1. 

As it can be seen, the path algorithm is consistently faster than the component-wise 
optimization algorithm for all but the largest problems. They are also both much faster 
than the general convex solver CVX. In addition to this, the path algorithm also returns an 
object that stores the complete solution path in a compact form and can be used to extract 
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n 


1U 


1U 


1U 


1 n5 
1U 


1 n6 
ill 


1U 


cvx 


17.1 


30.2 


210 


3600 


>5 hours 


>5 hours 


Component-wise Alg. 


0.071 


0.081 


0.24 


1.1 


10 


98 


Path Alg. 


0.0006 


0.003 


0.030 


0.52 


7.8 


108 



Table 1: Time in seconds for a 1-dimensional FLSA problem of size n. All three algorithms 
calculate the solution for 50 equally spaced values of A 2 from to 1. Results averaged over 
10 simulations. 

solutions for additional values of A2 very quickly. 

After deriving the path algorithm for the one-dimensional Fused Lasso Signal Approx- 
imator, we want to generalize the algorithm to the case of the general Fused Lasso Signal 
Approximator. The most important difference to the previous algorithm is that a set of fused 
coefficients can also break into several sets for increasing values of A 2 . We will get into more 
detail in the next section. 

3 General Fused Lasso Signal Approximator 

In the introduction we have already seen an example where a more general penalty structure 
than in the one-dimensional FLSA can be very useful for reconstructing a noisy image. 
However, we do not need to restrict our attention to a two-dimensional grid. In this section, 
we will present an algorithm that finds the solution for the FLSA problem with an arbitrary 
graph Q = (V, E) (with set of vertices V — {1, . . . , n} and edges E) specifying the structure 
of the penalty parameter on differences. The loss function in this case is 

n p 

L(y,f3) = -J2(y*-fr) 2 + ^J2W + X i E ( 2 ) 

i=i i=i (i,j)eE,i<j 

so that we penalize \0i — f3j\ for every edge G E. The condition i < j only makes sure 
that we penalize \f3i — f3j\ only once as the edges in the graph are assumed to be undirected. 
As in the previous section, we can use the soft-thresholding theorem (see Theorem 1) and it 
is therefore possible to set Ai = and find a solution path for A 2 and later obtain any solution 
for Ai > by soft-thresholding. The algorithm for this more general case is conceptually 
similar to the one presented for the one-dimensional case. However, unlike in this simpler 
setup, for the more general penalty structure it is not guaranteed that a group of coefficients 
that is fused for value A 2 = A° will remain fused for A 2 > A°, but instead fused groups 
may break up for increasing A 2 . The main adjustment to deal with this problem will be to 
introduce a method to determine for which value of A 2 a group of fused variables will break 
up. 

In the following subsections, we will first make some necessary adjustments to the defi- 
nition of sets of fused variables and given these sets, we calculate d(3 Fi /d\ 2 . Next, we give 
the conditions under which a set of fused coefficients breaks up into two smaller sets and 
present a method on how to calculate critical values of A 2 for which this could happen. After 
incorporating this into the final algorithm, we present an approximate version of our method 
that is faster on large dataset by sacrificing some precision. Finally, we use our new algo- 
rithm on simulated data and compare its speed and accuracy to the other methods we have 
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already used for the one-dimensional FLSA. Now, let us first make the necessary changes to 
the definition of fused coefficients to account for the general graph structure Q. 



3.1 Sets of fused variables 

In the case of the one-dimensional FLSA before, we already specified certain conditions for 
the sets of fused variables Fj in Definition 1. Here, due to the more complicated structure of 
the graph Q, we have to restate the condition that any set has to be an interval. For general 
graphs, the condition is instead that any set of fused variables has to be connected. For the 
definition and also in the following sections, assume that we know the minimizer of the loss 
function for penalty parameter A 2 and denote it by p Fi (\ 2 ). Then the definition of a valid 
set of fused variables is: 

Definition 2. Let ni?(A 2 ) be the number of sets of fused variables for penalty parameter A°. 
Then for the sets Fj, i — 1, . . . , iif^X®) to be valid, the following conditions have to hold: 



3. If k,l G Fi then /^(A^) = A (^2) an< ^ ^ G FiJ e Fj,i 7^ j and Fj and Fj have a 
connecting edge, then flki^z) 7^ $1(^2) for all penalty parameters A 2 G (A°, A 2 + e) for 
some e > 0. 

4- If k,l G Fj then k and I are connected in Q by only going over nodes in F i} i.e. k,l are 
connected in Q\F i} the subgraph of Q induced by Fj. 

Compared to the previous version of this definition, the third and fourth condition have 
been adapted. The third condition now reflects that sets can be split up for increasing A 2 
and the last one is equivalent to the requirement that Fj is an interval in the one-dimensional 
FLSA case. 

Now, assuming that the sets of variables Fj that are fused are given, we will determine 
the slope of the optimal f3p i with respect to A 2 . In order to do this, we will incorporate the 
sets of fused coefficients into the loss function (2). Setting f3 k = f3 Fi for all k G Fj, the loss 
function becomes 



Here, note that by definition of the sets Fj, we have that /3f;(A 2 ) 7^ Af,-(A 2 ) for j ^ i 
(except for a finite number of A 2 for which sets are fused or split) and therefore the loss 
function is differentiable with respect to f3 Fi at the solution f3 Fi (\ 2 ). Thus, at f3 Fi (\ 2 ), the 



1. uzrF = {i,---,n} 

2. FiDFj = j 




+ \ 2 J2\{(k,l)eE:keF i ,le F 3 }\ \p Fi - p F . 



(3) 



i<j 
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derivative of L Ft \ 2 (y, (3) with respect to (5 Fi is 0, that is 



dL FM {y,f3) 



F i \p Fi -Y / yj+ 



jeFi 



+ \ 2 ^2\{(k,l) EE:keFi,le Fj}\sign 



for the solutions (3 Fi (\ 2 ). By taking the derivative w.r.t. A 2 and noting that for small changes 
of A 2 , the sign of f3 Fi — (3 Fj does not change, it is possible to determine df3 Fi (\ 2 )/d\ 2 as 



which is constant as long as the Fj do not change. Therefore, the solution (3 Fi (\ 2 ) is a 
piecewise linear function again. At the breakpoints of the solution path, the sets of fused 
variables change. As we will see in more detail later, there are 2 things that can happen: 

• PfX^) = ftFji^) for some i ^ j with Fj and Fj connected, which violates condition 3 
of Definition 2. In this case, fuse sets F, and Fj. 

• A set Fi has to be broken up into 2 smaller subsets. A way to determine when this has 
to happen is presented below. 

Once the sets have been updated, the solution path is again linear so that the whole solution 
path for A 2 can be obtained by updating the sets of fused coefficients at the right values of A 2 . 
Now we will take a closer look at the conditions under which we split a set of fused variables. 

3.2 Splitting and fusing sets of variables 

In order to see when it is necessary to split sets of variables, we have to see when the solution 
(3 Fi (\ 2 ) and its derivatives obtained from the constrained loss function L F ^ 2 from Equation 
(3) is also optimal for the unconstrained loss L\ 2 in Equation (2). For this, we will look at 
the subgradient equations of L\ 2 . An overview of subgradients can be found in Bertsekas 
[1999]. 

3.2,1 Subgradient equations 

As Lx 2 is not differentiable everywhere, it is convenient to use subgradients instead of the 
usual derivatives. For the subgradients, a necessary and sufficient condition for f3 k to be 
optimal is that 



where t kl = sign(/3 fe - A) for fa ^ A, hi G [-1, 1] for (3 k = ft. For the case (3 k = fa it is also 
enforced that t k \ = —t\ k (which is trivially true in the f3 k ^ fa case). Given a grouping F, 



d\ 2 



E^m,l)eE:k 



G F, / G Fj} \ sign {/3 Fi - /3 Fj ) 



(4) 




for k = 1, . . . , n 



(5) 
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these equations can be written slightly differently. With k G Fj, let 



dL X2 (y,P) 

d(3 k 



Pk - Vk + A 2 t M + A 2 ^2 tki = for k — l,...,n 

i+3 (k,l)eE:leFj (k,l)eE:leFi 



where we grouped the t k i for which k,l G Fj. Assuming that PfX^i) is a minimizer of the 
unconstrained loss function, there exist t k i(X 2 ) such that the subgradient equations hold. 
Writing r k i{\ 2 ) = A 2 t^(A 2 ) and taking the derivative w.r.t. A 2 in the subgradient equations, 
we get 



where we exploit the fact that t k i — ±1 is constant for k G Fj and / G F,-. Note that there are 
not necessarily unique values for dr k i/d\ 2 such that these equations hold and there may be 
an infinite number of possible solutions, any of which will serve our purpose. Also note that 
from above we know that as long as the set F stays fixed, d/3 k /d\ 2 is constant for k G Fj. 
This is the case because t k i for k G Fj and I G Fj with i ^ j is defined as t k i = sign(/3fc — Pi) 
and stays constant as long as the order of p k and Pi does not change. However, if the order 
changes, then by Definition 2 the set Fj has to change as well. Therefore, the equation above 
stays the same as long as F is fixed. Therefore, the dr k i/d\ 2 only depend on the groups and 
not on A 2 . However, apart from Equation (6), it also has to hold that r fe ;(A 2 ) G [— A 2 , A 2 ], as 
the condition t k i G [—1,1] has to hold. Thus, when we keep the groups F fixed, it will not 
always be possible to find dr k i/d\ 2 such that for increasing A 2 these conditions still hold. As 
dr k i/d\ 2 is constant, the r k i are piecewise linear and the t k \ are continuous in A 2 . Then there 
are two ways in which the subgradient conditions can fail: 

1. There is a group Fj for which for increasing A 2 , the condition — 1 < t k i < 1 for all 
k,l G Fj cannot be satisfied anymore. In this case, the group has to be split into two 
smaller subgroups. How to decide when this is the case and how to identify the two 
new subgroups will be treated below. 

2. There are two groups Fj and Fj for which there exists k G Fj and / G Fj with (k, I) G F, 
i.e. Fj and Fj have at least one edge connecting them. For these two groups, we have 
that t k i = ±1 does not hold anymore for all k G Fj and I G Fj. In this case, the groups 
Fj and Fj have to be fused. 

It is very easy to detect when the second case above occurs, i.e., when two groups have to be 
fused. If we have two groups Fj and Fj that are directly connected to each other by an edge, 
then they will be fused at A° if PfX^) = PFj{X 2 ) and /3^(A 2 ) ^ /9f,(A 2 ) for some interval 
A 2 G (A° — e, A°), i.e. groups Fj and Fj "hit" each other at A°. Detecting when a group has to 
be broken into two smaller groups is harder and we will discuss it in the following subsection. 

3.2.2 The maximum flow problem 

In order to decide when to split variables, it is necessary to find solutions for t k i(\ 2 ) (or 
equivalently r k i) in Equation (6). For k and / that are not in the same group, this will be 
easy as then t ki = sign(/3 Fi — p Fj ) for k G Fj and / G Fj. For k and / in the same group, i.e. 
k, I G F, we will see that r k i is an affine function of A 2 and the slope can be calculated by 




for k — 1, . . . , n. 



(6) 
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solving a maximum flow problem in graph theory. The same calculations can also be used 
to identify when a group has to be split by (more on this later). As maximum flow problems 
are a well studied area, fast algorithms for this problem such as the push-relabel algorithm, 
among others, exist and can be used here (see Cormen et al. [2001]). Before going into more 
details, assume that for penalty parameter A2, we know Tki(X 2 ) for k, I G F, i — 1, . . . , np(A 2 ) 
that solve the subgradient Equation (6). Now, for notational convenience, define 



= - 53 53 tH 



d(3 Fi {\2 



d\ 2 

for k G F and call it the push on node k as it measures the influence other variables in 
neighboring groups that are connected to k have on node k. Using these definitions, Equation 
(6) can be written as 

53 ^ =pk for k = !» • • • i n - ( 7 ) 

(k,l)eE:leFi 2 

For each of these equations, we can see that they only involve variables that belong the same 
groups, i.e. if k G F i} then all / used for the variables r k i are also in Fj and thus these n 
equations are separable according to the groups Fj. Therefore, for each of the groups F, we 
will solve a separate maximum flow problem to find <9rfcz/<9A 2 and determine if it is necessary 
to split the group. 

For any maximum flow problem, we need to specify the underlying graph which consists 
of vertices, edges and capacities on each edge in both directions. In order to do this, we first 
define the graph Qi = Q\f^ which is the graph Q restricted to the nodes in the set F. Now 
let Qi = (Vi,Ei,Ci) be the vertices, edges and capacities of the i-th problem. For these, we 
define: 

Vertices: To each of the subgraphs Q i} we add an artificial source node r and sink node s, 
such that Vi = ViU {r, s}. 

Edges: For the edges Fj we use all the edges in Fj and will add additional edges connecting 
each of the nodes in Vj to either the source or the sink. In order to motivate which 
nodes will be connected to which, note that at the end, we will set dr k i/d\ 2 = fki for 
the maximal flow fki from node k to node /. As the flow through every node (except 
for the source and sink) has to be 0, a node has an edge with the source if the RHS of 
equation (7) is greater than and an edge with the sink if it is less than 0. Thus 



Fj = Fj U {(r,l) : Pl >0}U {(k,s) : p k < 0} 



As usual, all the edges are undirected. 

Capacities: Of course, the capacities on the edges have to be defined as well and we will 
define them for each direction separately. As we will set dr k i/d\ 2 = fki for k, I G F i} the 
flows have to be constrained such that r M (A 2 ) stays within the interval [— A 2 , A 2 ]. This 
has to hold for all edges in Fj. For the edges to the source or the sink, the corresponding 
absolute value on the RHS of Equation (7) will be used. This way, with drki/d\ 2 = fki, 
Equation (7) will hold if and only if all edges coming from the source and all edges 
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going to the sink are at full capacity. 

In order to ensure that T kl (\ 2 ) G [— A 2 , A 2 ], we have the restrictions 



^_ I (-00,00) if t m G (-A 2 , A 2 ) 
G < (-00, 1] if t m = A 2 
[[-l,oo) ifr M = -A 2 . 

Correspondingly, we set for k,l G Fj 

{(-00,00) if r fcI G (-A 2 , A 2 ) 

(l,oo) if r H = A 2 

(00, 1) if r w = -A 2 . 

Now it only remains to define the capacities on the edges coming from the source or 
going to the sink. First, for the edges from the source r, i.e. all / for which (r, /) G E i} 

(c H ,Ci r ) = (p,,0) 

and correspondingly for the edges to the sink set for all k with (k, s) G E i} 

(Cks, c sk ) = (~Pk, 0) • 

Using all this set Q — |c w : k,l e V^j. 
Here it is interesting to note that 

which is easy to see by summing up Eequations (6), the definition of pk and that tki = ~Uk- 
Therefore, the sum of all capacities going out of the source is equal to the sum of all capacities 
going into the sink and thus a flow that is maximal for all source edges is also maximal for 
all sink edges. 

Now that we have defined the maximal flow problem, we will not go into any detail of 
how to solve it and just refer to the literature that we have already mentioned above. In 
the following, the solution to the flow problem will be referred to as fki, which is the flow 
from node k to node /, assuming that k, I G E^. Using this result we will now show that 
the solution path is piecewise-linear. The next theorem guarantees that for an interval, the 
solution for /3fe(A 2 ) and r^(A 2 ) are affine and have the slope as stated above. 

Theorem 3. For some A°, let Fi, . . . , F nF ^ be a valid grouping of the variables. Let Qi be 
the with Fi associated maximum-flow graph as defined above. Also let /3jt(A 2 ) and r^(A 2 ) be 
a solution to the FLSA problem for penalty parameter A 2 = A 2 . If Qi has a maximum flow 
for which all flows coming from the source are at maximum capacity (i.e. f r i = c r i for all 
(r,l) G Ei), and is as defined in equation (4), then there exists some A > such that 
for any A 2 G [A°, A° + A], the solution to the FLSA problem is given by 

A(A 2 )=/5 fc (A°) + ^(A°)-(A 2 -A°) for k G F t 
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and 



^_ _ | r fe /(A^) + /fej(A 2 - A!]) /or k,l e Fi for some i 

1 sign(rhi(X 2 ))\ 2 otherwise. 

The proof of this theorem can be found in the Online Supplement in Section 3. The only 
item in the previous proof that we haven't specified so far is the length of the interval A, 
for which the solution will be linear as described. There are two things that can occur, that 
would violate the assumptions of Theorem 3. First, two sets of variables Fi and Fj that 
have a connecting edge have Pf { = Pfj and therefore, the grouping is not valid anymore. 
In this case, the two sets have to be merged. Second, for some group Fj, the maximum 
flow problem is not at maximal capacity for all source nodes. Then, this group has to be 
split. However, before proving that these operations yield a valid grouping for which the 
assumption of Theorem 3 hold, we will determine A. 



3.2.3 The hitting time h and splitting time v 

In order to find A, we will first determine the smallest value A 2 > A° where two neighboring 
groups have the same coefficient. Next, we will determine the smallest A 2 > X 2 such that the 
conditions on r k i are violated. A° + A is then the smallest of these two values. 

Start by assuming that for penalty parameter A° we have a valid grouping F ± , . . . , F np ^ 
and solutions /3fc(A°) as well as t^A"). Given this, it is easy to calculate when two sets that 
are connected by an edge hit. For this, let the hitting time of groups i and j at A° be 



hij(\l) 



' (Pf, - (3 F] )/ (|£ _ + \o if 3k EF t ,le Fj with (k, I) EE 
oo otherwise. 



If hij < X 2 , then given the current slopes, these groups will not meet for A 2 > X 2 . If hy = X 2 , 
then /^(Aj) = /3f,(A 2 )- However, as we assumed that this is a valid grouping, from the 
definition we get that for some e > 0, for any A 2 G (A°, A^ + e) we have /^(A 2 ) ^ f3 Fj (\ 2 ) and 
as the trajectories are piecewise affine, the groups Fi and Fj move apart. Therefore, defining 

h(X° 2 ) = min hij(\° 2 ) 

we have that two groups will hit at h but not before. Therefore, the grouping remains valid 
for at least an interval h — A°. 

Now let us look at the maximum flow problem and how long the interval can be such 
that for all r H (A 2 ) G [— A 2 , A 2 ]. Then, given the flows f kh define the violation time of the 
constraint on tm as 



( |sign(/ fc ,)A°-nH(A°)| ., ,, 

vh(%) = \ ^ + 2 



> 1 



oo otherwise, 



and set 



v(\° 2 ) = min^A^). 

Then given the behavior of the r k i described in Theorem 3, we have that r^(A 2 ) G [— A 2 , A 2 ] 
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for A 2 G [A°,i>]. Furthermore, for A 2 > A 2 + v, at least one of the constraints would be 
violated. However, note that this does not necessarily mean that the group has to be split. 
It may just be necessary to find a new maximal flow that satisfies the constraints. Using the 
hitting and violation time, we can set 

A = mm(h(\° 2 ),v(\ 2 ))-\° 2 . 

We can now distinguish two cases: 

Case 1 h < v: Here, the two sets that hit at A 2 = h have to be merged. 

Case 2 h > v: In one of the sets, say F i} for at least one edge (k, I) with k, I G Fj we have 
r ki( v ) = if and if the slope remained unchanged, then |rfc/(A 2 )| > |A 2 | for A 2 > v, 
thus the constraint would be violated. Then the capacity constraints of the associated 
maximum flow problem has to be updated and a new maximum flow identified. If for 
the new flow, all source edges are at capacity, only the trajectories for the Tu have to 
be altered. Otherwise, the set has to be split. 

Therefore, we have defined the A of the previous theorem and identified the values of A 2 at 
which the piecewise-linear solution path has breakpoints. 

3.2,4 Adapting the sets of fused variables 

Of course, we still have to specify how to exactly split a set for which the source edges are 
not at capacity into two smaller subsets. Assume that Fj is the set that has to be split with 
associated maximum flow graph Qi. Then define the set 

Ri — |i G Fi : r connected to / by an augmenting path in C?j j 

where the augmenting path is defined with respect to the maximal flow fku i- e - f° r eac h node 
I G Ri there exists a path from the source r to / using only edges for which the flow is not at 
capacity. The complement of Ri with respect to Fj is defined as Si = Fj\Fj. Then we divide 
the set Fj into the two subsets Ri and Si. 

Now it remains to be shown that fusing or splitting sets as described above will yield sets 
of fused variables that satisfy the assumptions of Theorem 3. In particular, whenever we are 
at a breakpoint, i.e. have to fuse or split sets or both, we propose the following procedure 
for adapting the sets of fused variables: 

1. If there are sets Fj and Fj for which 3k G Fj and / G Fj with (k, I) G E and (3f 1 = Pfj, 
then fuse these sets into a new set Fj = Fj U Fj if (<9/%/<9A 2 ) — (<9 ( 5f 3 /9A 2 ) < and 
tki = 1- 

2. If there is a set Fj for which in the associated maximal flow graph not all edges coming 
from the source are at maximal capacity, then split F in the two subsets Ri and Si as 
described above. 

3. Iterate steps 2 and 3 until nothing changes. 

Using this procedure we can now show that adapting the sets in this way is correct. 
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Algorithm 2: General FLSA path algorithm 

initialize 

A 2 = 0; 

(3k = Vk for k = 1, . .. ,n; 

r fci = for fc, Z = 1, . . . , n; k ^ I; 

Fi = {i} for i — 1, . . . , n; 

riF = n; 
end 

while n F > 1 do 

Update f3p i and th; 

Calculate the derivatives of (5p i w.r.t. A 2 for i = 1, . . . , np\ 
Solve the maximum flow problem for for % — 1, . . . , rip] 
if not all flows from source are at capacity for graph Qi then 

Split set Fi into two smaller sets; 

rip := rip + 1; 
else 

Calculate next hitting time h(\ 2 ); 

Calculate the next violation time v (A 2 ) when a set has to be checked for 
violation of the constraints; 

if h(X 2 ) < v(X 2 ) then 

Fuse the two sets that hit each other; 

riF := np — 1; 
end 
end 

Set A 2 := min{/i(A 2 ),-i;(A 2 )}; 
end 



Proposition 1. Assume that we perform the fusion and split steps as described above. Then, 
the algorithm stops after a finite number of fusion and splitting steps and the resulting sets 
of variables are valid and satisfy the assumptions of Theorem 3. 

Again, the proof can be found in the Online Supplement in Section 3. Putting all this 
together, we have shown that the solutions are piecewise linear and how to change the sets 
of fused variables at the breakpoints. So overall, using this algorithm we can calculate the 
entire solution path. 

3.3 Outline of the algorithm 

In the previous sections, we have seen how to derive the entire solution path of the gen- 
eral Fused Lasso Signal Approximator. Overall, the algorithm is very similar to the one- 
dimensional FLSA outlined in Algorithm 1. The most important change is that, instead of 
only considering the fusion of sets, it is also necessary to track if it is necessary to break a 
set up into two smaller sets. Putting everything from the previous sections together, we get 
an outline of the algorithm for the general FLSA. 

It should be noted that this is a basic outline of the algorithm and there is room for 
considerable efficiency gains when implementing it. Most importantly, similar to the one- 
dimensional case from above, the hitting times hij only have to be updated if either the set 
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Fi or Fj have changed since the last calculation. The same is true for the maximum flow 
problems. The flows only have to be updated if the underlying set has changed or a violation 
of a constraint was triggered. Therefore, in every iteration only a small number of sets is 
involved in the calculation and the computations can be done quickly. Especially for larger 
sets, the computationally most expensive step is solving the maximum flow problem. This 
gives us the possibility to derive an approximate version of the algorithm that is much faster 
as we will see in the simulations section below. 

3.4 Approximate algorithm 

The algorithm described above gives an exact solution to the problem. However, for large 
sets of fused variables, the calculation of a maximal flow is a bottleneck. In addition to this, 
if large sets of fused variables split, the resulting sets tend to be very unequal in size, often 
only splitting off a couple of nodes on the edges. Therefore, a lot of time is spent on cases 
that do not influence the solution very much. In order to speed up the algorithm, we propose 
to not check sets of fused variables for splitting up once they are larger than a certain size K. 
Then, for any set of size K or larger, the maximum flow problem does not have to be solved, 
saving time on these especially computationally expensive sets. Also, as the values of tm are 
only used to determine when a set has to be split, these also do not have to be updated any 
more for the large sets. As we will see in the simulations section, the tradeoff in accuracy for 
moderate values of K is not very large but the algorithm speeds up considerably. 

3.5 Simulations 

In order to evaluate the performance of our exact and approximate algorithms above, we want 
to compare its speed and accuracy to the approximate FLSA algorithm for the 2-dimensional 
case presented in Friedman et al. [2007] as well as CVX (see Grant and Boyd [2008a,b]). 
As we also want to compare the accuracy of the approximate algorithms, we will use these 
techniques on simulated datasets, which we describe in more detail below. 

The comparisons between the algorithms will be performed on datasets of various sizes 
ranging from 10 x 10 to 200 x 200. On each of these datasets, the solution will be computed 
for 50 equally spaced values of A 2 between and 0.5. For this speed comparison, there are 
two things to note. 

First, as noted before, CVX cannot use the solution of a similar A 2 as a "warm start" to 
speed up computation. So, computing the solution for all 50 values of A 2 takes roughly 50 
times as long as computing the solution for just one value of A 2 . However, we chose to use it 
nonetheless as it is an easy to use general convex solver that can handle sparse matrices and 
is therefore equipped to handle large datasets. 

Second, the algorithm that is presented in this paper not only calculates the solution at 
the 50 values of A 2 , but at all breakpoints of the piecewise-linear solution. The whole path is 
saved in a compact format and can be used to extract the solution at other values of A 2 later 
much faster. For our dataset here, the solution path has at least as many breakpoints as 
there are datapoints, i.e. n 2 for annxn grid. However, we still only let the other algorithm 
evaluate it for 50 values of A 2 as we deemed it unrealistic that the solution for possibly 
thousands of A 2 values is needed. 
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Figure 4: A sample image of the simulated 2-dimensional dataset 

3.5.1 The dataset 

In our comparisons, we want to use the 2-dimensional FLSA. Therefore, our data will consist 
of data y = {yki} with k = 1, . . . ,ni and I = 1, . . . ,n 2 with corresponding coefficients (3^. 
The difference of coefficients will be penalized if they are neighbors on the 2-dimensional grid 
(horizontal or vertical), i.e. the loss function we want to minimize is 

ni n 2 n\— 1 n 2 nj n 2 — 1 

fe=l 1=1 k=l 1=1 k=l 1=1 

where we have already set Ai = 0. As shown above, we can get the solution for any Ai by 
soft-thresholding the solution for Ai = 0. In our simulated dataset, we set ri\ = n 2 = n for 
various values of n. The value of y^i is being generated as follows: 

1. Set yki = for all k, I = 1, . . . , n. 

2. For some rectangles of random size, change the value of y to either 1 or 2, such that 
roughly 20% have value 1 and 20% have value 2. 

3. To every point yu add standard normal noise with standard deviation of 0.2. 
A sample image of what the simulated dataset looks like can be seen in Figure 4. 

3.5.2 Results 

First, we compare the computation time of the three methods. The results can be seen in 
Table 2. The path algorithm as well as the component-wise algorithm are both much faster 
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Image size 


10 x 10 


50 x 50 


100 x 100 


200 x 200 


CVX 


36 


140 


1000 


6800 


Component-wise Alg. 


0.062 


0.61 


2.2 


7.7 




K = 1 


0.0031 


0.17 


1.1 


10 




K = 2 


0.0032 


0.18 


1.1 


10 




K = 5 


0.0040 


0.20 


1.3 


10 




K =10 


0.0046 


0.23 


1.4 


11 




K = 50 


0.0077 


0.37 


2.0 


14 


Path Alg. 


K = 100 


- 


0.51 


3.0 


18 




K = 500 


- 


1.3 


11 


62 




K = 1000 




1.4 


21 


120 




K = 2000 




1.4 


22 


290 




if = 5000 






23 


400 




exact 


0.0079 


1.4 


24 





Table 2: Speed comparison for the 2-dimensional FLSA in seconds. The solution is evaluated 
for 50 values of A 2 between and 0.5. The results are averaged over 10 runs for the 10 x 10 
and 50 x 50 dataset and 4 simulation runs for the rest. 

than the general convex solver CVX. When comparing the path algorithm to the component- 
wise algorithm, we see that they have roughly the same speed for low values of K , except for 
the small 10 x 10 dataset. 

With respect to accuracy, we measure both the sup-norm error as well as the root mean 
squared deviation (RMSD). These measures are being calculated for each value of A 2 and 
the largest values, averaged over several simulation runs are being displayed in Tables 3 
and 4. CVX returns the exact solution in all cases, although at the cost of a rather slow 
speed. The component-wise algorithm on the other hand is quite fast, however only yields 
an approximate solution, although with a rather small error rate in terms of RMSD. With 
varying K, the path algorithm is in between the other two methods. However, for values of 
K around 500 — 1000, the path algorithm is very accurate but still a lot faster than CVX. 
For most practical application, this time-accuracy tradeoff may be worthwhile. 

4 Conclusion 

In this article we develop a path algorithm for the Fused Lasso Signal Approximator in its one- 
dimensional and general form. We compared the speed and accuracy of the FLSA algorithm 
to other available methods and conclude that our method has advantages in terms of speed 
and the amount of information gathered and stored. Especially compared to standard convex 
solvers, our path algorithm is much faster for the FLSA. It is also very easy and quick to 
extract results for additional penalty parameter values. We also extend this work to the case 
of the general Fused Lasso where we restrict ourselves to predictor matrices with rank(X) = p 
where X £ R" xp (see the Online Supplement). 

Apart from the work presented here, there are several ways how we plan to expand on 
it in the future. It is possible to expand the Fused Lasso by allowing each summand in the 
penalty terms to have separate weights, i.e. a loss function of the form 
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Image size 


1U X 1U 


OU X ou 


inn y 1 nn 

1UU X 1UU 


ZUU X zuu 


CVX 














Component-wise Alg. 


0.30 


0.80 


0.73 


1.0 




K = 1 


0.31 


0.81 


0.73 


1.0 




K = 2 


0.26 


0.75 


0.73 


1.0 




K = 5 


0.22 


0.74 


0.73 


0.90 




K =10 


0.15 


0.68 


0.73 


0.90 




K = 50 





c\ A A 
0.44 


0.22 


0.52 


Path Alg. 


K = 100 


U 


U.OO 


n 1 q 
U.lo 


n a q 
U.4o 




K = 500 





0.025 


0.047 


0.14 




K = 1000 








0.013 


0.12 




K = 2000 








0.00017 


0.050 




K = 5000 








0.00017 


0.021 




exact 














Table 3: Absolute error accuracy comparison for the 2-dimensional FLSA. The accuracy 
of the approximate version of the path algorithm and the component-wise algorithm are 
compared to the exact solution using the supremum norm. The largest error of the 50 values 
of A 2 is reported. The results are averaged over 10 runs for the 10 x 10 and 50 x 50 dataset 
and 4 simulation runs for the rest. 



Image size 


10 x 10 


50 x 50 


100 x 100 


200 x 200 


CVX 














Component-wise Alg. 


0.056 


0.066 


0.041 


0.030 




K = 1 


0.059 


0.067 


0.045 


0.031 




K = 2 


0.056 


0.066 


0.044 


0.030 




K = 5 


0.041 


0.059 


0.042 


0.029 




K =10 


0.029 


0.053 


0.039 


0.027 




K = 50 


< 10~ 5 


0.035 


0.027 


0.020 


Path Alg. 


K = 100 





0.022 


0.022 


0.015 




K = 500 





0.0050 


0.0059 


0.0078 




K = 1000 





< 10~ 5 


0.00026 


0.0053 




K = 2000 





< 10~ 5 


< io- 5 


0.0028 




K = 5000 








< io- 5 


0.0019 




exact 














Table 4: Root mean square deviation (RMSD) accuracy comparison for the 2-dimensional 
FLSA. The accuracy of the approximate version of the path algorithm and the component- 
wise algorithm are compared to the exact solution using the RMSD. The largest error of the 
50 values of A 2 is reported. The results are averaged over 10 runs for the 10 x 10 and 50 x 50 
dataset and 4 simulation runs for the rest. 
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L XlM (y, X,/3) = ±(y- X(3) T (y - Xf3) + X, ]T w k \/3 k \ + A 2 £ w kl \(3 k - 

fc=l (k,l)eE,k<l 

This more complicated model can be solved by a generalization of the algorithms presented 
in this article. Also, the current algorithm for the FLSA is not optimized form graphs with 
large number of edges in them. We plan to develop a version that takes cliques into account 
to achieve further efficiency gains. 

We hope that our algorithms will be used to analyze data and as a building block for 
other new models. In order to facilitate this we will be publishing implementations of the 
algorithms in the form of an R package on CRAN and the authors website. 
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O 1 Introduction 

m 

In this online supplement to the article "A path algorithm for the Fused Lasso", we will 
give proofs and further results that would have gone beyond the possible size and scope of 
the paper. In Section 2 we will discuss the computational and memory complexity of the 
algorithm. In addition to this we will give an alternative proof to Theorem 2 of the article in 
which we show that for increasing A 2 , fused sets of variables do not become unfused again. In 
Section 3 we will give the proofs to Theorem 3 and Proposition 1 that complete our results 
for the general Fused Lasso Signal Approximator. Finally, in Section 4, we will extend the 
path algorithm of the FLSA to the case of the general Fused Lasso where for the matrix of 
predictors X e R nxp we have rank(X) = p and we also present the associated proofs. 

O 

2 One dimensional FLSA 

2.1 Computational complexity 

• • 

In Section 2 of the article, we have presented the algorithm that solves the one-dimensional 
^ Fused Lasso Signal Approximator. We stated that the computational complexity of the 

algorithm is nlog(n). In order to calculate the complexity of the algorithm, first note that 
the initialization step takes 2n + 2 operations. However, most of the computations are 
performed for calculating the next hitting time h(\ 2 ). When A 2 = 0, the derivative df3pjd\2 
has to be computed for all n groups and the smallest hitting time identified. This takes on the 
order 0{n) operations. However, in subsequent steps, h^+i remain the same except if either 
Fi or F i+ i was fused with its neighbor in the last iterations. This way, in each iteration only 
2 values of h^+i have to be updated. Finding the smallest value of ft^j+i requires operations 
on the order 0(log(n)) if an efficient data structure is used to save the h^ i+ i (e.g. a binary 
tree). The same is true for the updates of the f3 Fi . Only the f3 Fi of the groups that have 
just been fused have to be updated. For the other groups dfipJdXi stays the same, so we 
can interpolate PfA^z) when we need it based on A° when this group was created as well as 
/^(Ag) and its derivative. So this also only involves a constant number of operations per 
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Node 1 Node 2 Node 3 Node 4 




Figure 1: Example of a tree for storing the solution path of the one-dimensional FLSA. 

iteration. As with every iteration, the number of sets rip decreases by one, there are at most 
n — 1 iterations. Therefore, the entire solution path can be obtained with a computational 
complexity on the order 0{n log (n)). 

2.2 Storing the solution path efficiently 

One possibility to store and retrieve the optimal coefficients for a value of A 2 would be to store 
the whole coefficient vector (3 for every value of A 2 for which 2 sets are fused and interpolate 
linearly in between. However, then it would be necessary to store n 2 entries, which would 
be impossible for large values of n. However, when done efficiently, it is possible to store the 
entire solution with a memory requirement that only grows linearly in n. For this, the result 
is stored in the form of a binary tree. Every node contains a value for A 2 at which this node 
(corresponding to a set of fused variables) becomes active and the correct value of j3p i for 
this same value A 2 . 

We start by creating n unconnected nodes corresponding to the n coefficients for A 2 = 
and Pk = Vk- Then when two sets of coefficients fuse, add a new node that is the parent of 
the two nodes corresponding to the 2 groups that were just fused. In the new node, store the 
value of A 2 as well as the updated value of the coefficient f3 Fi . At the end of the algorithm, we 
will have one binary tree (for a small example see Figure 1). In order to retrieve one value of 
the solution at A 2 for coefficient /3k, first start at the leaf node of coefficient (3k- Climb up the 
tree until A 2 of the next parent node is larger than A 2 . The correct solution is then a linear 
interpolation between the (3f i values stored in the current node and the parent node. Using 
this method, the complexity for looking the whole vector of solutions f3 for a particular value 
of A 2 is then also 0(n log n). Therefore, running the path algorithm and getting a solution 
vector has a total complexity of 0(n log n). We will also be able to see this in the next section 
when we compare the speed of this algorithm to other existing methods for this problem. 

2.3 Alternative proof of Theorem 2 

An important prerequisite for the algorithm is the result of Theorem 2 in the article. A proof 
has already been given in Friedman et al. [2007] and here we want to give an alternative 
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version. It is based on the results of the Genreal Fused Lasso Signal Approximator, so read 
Section 3 of the article before this proof. 

Theorem 2 For the one-dimensional FLSA assume that for some A° > and Ai = 0, we 
have that (3k{\%) = (3k+i{\%) for some 1 < k < N — 1. Then, for any A 2 > A° we have that 
/3 fc (A 2 )=/3 fc+ i(A 2 ). 

Proof. First note that in Section 3.2.2, we described the maximum flow problem that we have 
to solve and under which conditions it is necessary to break up the set of fused variables. In 
the case of the one-dimensional FLSA, the maximum flow graph is especially simple. Assume 
that an arbitrary set of fused variables |F| is {k , k + l, . . . , k + \F\ — l}, which has the edges 
{(fco, k + 1), . . . , (ko + \F\ — 2, k + |F| — 1)}. For the capacities on these edges, it is sufficient 
to note that they are always > 1. Now it only remains to specify which edges are connected 
to the source node r and the sink node s and which capacities they have. Essentially, there 
are 4 possible situations: 

Case 1 Pk -i, Pk Q +\F\ > Pf- In this case we have that ^ = jjL Then we have that k and 
k + \F\ — 1 are connected to the source r and all other nodes are connected to the sink 
s. For the capacities from the source, we have c r \ — 1 — rprr < 1 and for the capacities 

to the sink it holds that Ck s = jj^ < 1. As the capacities on the inner nodes is at least 
1 as stated above, it is easy to see that for the maximum flow we always have f r \ = c T \ 
for / = k , k + \F\ — 1 and therefore the group will never be split. 

Case 2 (3k +i > Pf > f3k +\ F \ : H ere ? we have that ^ = and node k is connected to the 
source r whereas k + \F\ — 1 is connected to the sink s. All other nodes are not 
connected to either the source or the sink. For the capacities in this case we have 
Crk — c fc +|F|-i,s = 1 and therefore the maximum flow is always f r k = fk ,k +i = ■ ■ ■ = 
fk +\F\-2,k +\F\-i = fk +\F\-i,s = 1 and therefore, the group will never be split. 

Case 3 Pk +i < Pf < Pk +\F\- Similar to case 2. 

Case 4 (3 ko - 1: (3 ko+ \ F \ < (3 F : Similar to case 1. 

In these cases we assumed that k ^ 1 and k + \F\ — 1 ^ N, i.e. that the set of fused 
variables is not at the boundary of the graph. However, it is easy to see that in this case the 
set will also not break apart for increasing X 2 . Therefore, the proposition holds. □ 

3 The general Fused Lasso Signal Approximator 

In this section we want to give the proofs of Theorem 3 and Proposition 1 of the article. 
First, we will give the proof of Theorem 3, then the one of Proposition 1. 

3.1 Proof of Theorem 3 

Proof. In order to show that this is indeed the solution, we have to show that for any 
A 2 G [A^Ag + A], the subgradient Equations (5) hold. We will do this for every group Fj 
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separately. We know that 



dX 9 



tki- 



j& {k,l)eE;keFi;leFj 



For the maximum flow graph, we know that all nodes except the source and the sink have 
a net flow of 0. Furthermore, we also know that the sum of all capacities coming from the 
source is the same as the sum of all capacities going to the sink as 

^ c r i ~ £ c ks = ^2 Vk = 

(r,l)eEi (k,s)eEi k&Fi 

where we use that for any node that is not connected to either the source or the sink, p k = 0. 
From this we can see that if all flows coming from the source are at maximum capacity, so 
are all flows going to the sink. Therefore, for a maximum flow with all flows from the source 
being at maximum capacity, we have 



-fkr = c rk = p k if (r, k) G Ei 
£ fki= I -fks = -Cks = Pk if (k, s) G Ei 
:{k,i)£E;ieFi I = otherwise 



so that overall we get 



Pk(X 2 ) ~ Vk + A 2 £ 

l:(k,l)eE 

= v* + A 2 £ t kl (X° 2 ) + ^(X 2 - X° 2 ) 

l:(k,l)eE 2 

+ /^(A 2 -A°)+ t kl (X° 2 )(X 2 - X° 2 ) 



ieFi-.(k,i)eE 



i^Fi-.(k,i)eE 



= (A 2 - A°) 



d^F 
dX, 



+ /«+ E 



ieFi-.(k,i)eE 



i^Fi-.(k,i)eE 



(A 2 - A») 



£ f> 

l£Fi-.(k,l)eE 



H — Pk 



where we use the definition oi p k (X 2 ) = — J2igFi-(k i)eE ^(A 2 ) — ^f- in the second to last 
equality. As this is true for every k G Ei and group Fj, the solution as proposed in the 
theorem solves the subgradient equations and therefore minimizes the loss function. □ 



3.2 Proof of Proposition 1 

Proof. First, we want to show that after the fusion and splitting steps finish, we have a 
valid set of fused variables. For this, we have to show that for sets and Fj it holds that 
/3f;(A 2 ) 7^ Pf j (X 2 ) for X 2 G (X 2 ,X 2 + e) for some e > 0. However, this is certainly true. 
If PFi(X 2 ) 7^ f3 Fj (X 2 ), then this holds also for some interval A 2 G (A°, X 2 + e) as f3 k (X 2 ) is 
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continuous in A 2 . If we have that /^(A^) = (3 Fj {\%), then we can assume that -^(A^) > 

^^-(A 2 ) and tki — 1 for k E Fi,l E Fj;(k,l) E E, as otherwise Fi and Fj would have 
been fused. However, this means that /^(A^ > /3f.,(A 2 ) for A 2 E (A°, A° + e) and that 
tki{M) = sign(/3 Fi (A 2 ) -/3f j -(A 2 )), so the grouping is valid. 

Second, for all of the sets at the end of the splitting steps, the maximal flow condition 
of Theorem 3 trivially holds as any set for which it doesn't hold, would be split up into two 
smaller sets (and it always holds for sets of size 1). 

Therefore, it only remains to show that the fusion and splitting steps converge after a 
finite number of iterations. For this, we show that if a set Fi was split into subgroups Ri 
and Si at penalty parameter A 2 , then Ri and Si will not be merged at A 2 in a subsequent 
iteration. From this, we will conclude that there is only a finite number of possible iterations, 
as there is only a finite number of possible sets and we cannot have infinite cycles of fusions 
and splits for the sets. Therefore, the algorithm converges after a finite number of steps. 

So, as Fi was split into Ri and Si, we know that Ri ^ as well as Si ^ 0. For set Ri 
consider the capacities of edges the source into Ri minus the capacities of edges going from 
Ri either to Si or directly to the sink. As we split group F, we know that the capacities 
going into Ri are larger than the capacities going out and therefore, 

^2 c rl > ^2 °ki + ^2 ° ks - 

(r,l)eEi-leRi (k,l)&E t ;k£Ri,l£Si (k,s)^E i ;keR l 

For all edges (k, I) E Ei with k E Ri and I E Si we also know that 1 = Cki = fki = tki- Here 
fki = Cki follows as otherwise / would be connected to the source in the residual graph and 
therefore I E R i} which is not the case as I E Si. Also, it follows that c k i — 1, as c kt = oo is 
the only other option which cannot hold as the maximum flow is finite. Furthermore t^i = 1 
as Cki — 1. Then 

ch - ^2 c ks - ^2 tki> °- 

(r,l)eEi;leRi {k,s)&Ei;keRi (k,l)eE i ;keR t ,leS l 

Using that c r \ = pi for (r, I) E E; L and c ks = —pk for (k, s) E E^ we get 

J2 tki>0 - w 

k ^Ri (k,i)eE t -,keRi,ieSi 

Let p^ 4 be the push on node k in the graph associated with group Ri and p^ the push on 
node / associated with Si after the split. From equation (8), we know that 



Erf 



Rl - 0. 



We can also infer from the definition of the push on Fi and Ri that 

£«? + i*if£ = E»- E <*+i«.itf 

keRi fcefli keRi,ieS i: (k,i)eE 
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and therefore using equation (1) and YlktRiPk* = ® that 



d(3 Ri d(5 Fi 



dX 2 d\ 
Now, again using equation (8), we get 



90 Fi _ 


Ri\ 


d(3 Ri 


dX 2 


Ri\ 


+ 


\Si\ 


dX 2 



and thus 



+ 



d(l Ri > OPf 1> d(3 Si 





\Si\ 




\Ri\ + \Si\ 


dX 2 



dX 2 dX 2 dX 2 

As also t k i = 1 for k G Ri and I G Si, we can see that the groups that have just been split at 
A° cannot be merged again at the same penalty parameter A°. 

In turn, this also means that any two sets that have just been fused, cannot be immediately 
split up again into the same original sets at A 2 as this would lead us back to the starting 
position, which would force us to fuse the two sets again. However, we have just shown that 
this cannot happen and therefore we also see that any two sets that are being fused, are not 
being split up again immediately. □ 



4 Path algorithm for the General Fused Lasso 

In this section we want to expand the result of the article from a path algorithm of the general 
FLSA to a path algorithm for the general Fused Lasso with a predictor matrix X G R nxp 
where rank(X) = p. The loss function we want to minimize in this case is 

L Al;A2 (y,X,/3) = l -{y - X/3) T (y - X/3) + X 1 £ \/3 k \ + X 2 £ \/3 k - A| 

k=l (k,l)eE,k<l 

where X G R" xp and Q = (V, E) a graph that defines the penalty structure on the difference 
of coefficients (3 k — f3 t for (k, I) G E, the set of undirected edges. One important difference to 
the X = I case is that here, we cannot get the solution for all values of Ai by soft-thresholding 
the solutions for Ai = as we did for the FLSA. Therefore, we cannot calculate one path 
from which it is easy to get all solutions for any combinations of (Ai,A2). Instead, we will 
restrict ourselves to calculate the whole solution path for A 2 for a fixed value Ai. Therefore, 
we have to incorporate the additional penalty Ai Ylk=i \@k\ m to the algorithm. This will be 
done in a fashion similar to the LARS algorithm (see Efron et al. [2004]) by having active 
and inactive sets of fused variables. 

The reason that we restrict the development of the algorithm to the case of rank(X) = p 
is that if rank(X) < p, then the resulting solution path w.r.t. A 2 may have discontinuities. 
Here is a very small example where this happens. Assume that n — l,p — 2 with yi = 3 as 
well as xii = 1 and x± 2 = 0.5. For Ai = 1, the solution path for X 2 can be seen in Figure 2. 
As we can see, for A 2 = 1/3, the solution path has a jump. 

In this section, we will first enhance the definition of sets of fused variables to incorporate 
that these sets can now also be active or inactive. Next, we will show that the solution path 
is again piecewise linear and how to calculate its slope and the necessary changes at the 
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Solution path for discontinuity example 




Figure 2: Example of a discontinuous solution path if rank(X) < p. 

breakpoints. We will end this section by giving the complete algorithm that calculates the 
solution path for A2 where Ai is assumed to be fixed. 



4.1 Sets of fused variables and active sets 

In the case of the general Fused Lasso, it is not possible to assume that Ai = and therefore 
we have to account for the additional penalty Ai Y^,k=i lAfcl m the loss function. Therefore, in 
addition to sets of fused variables as in the case of the FLSA algorithm, it is also necessary to 
keep track if these sets Fj are currently "active", that is have an associated coefficient f3 Fi 7^ 0, 
or inactive when f3 F . =0. In order to account for this, we have to adapt the definition of 
sets of variables. However, before we do this we should have a look at the fc-th subgradient 
equation for the general loss function first: 



dL 



Ai,A2 



d(5 k 



-(X T y) k + (X T Xf3) k + \ lSk + A 2 

h(k,i)eE 



tki 



(2) 



where s k = sign(/3 fc ) for (3 k 7^ and s k G [— 1, 1] otherwise. Also, as before t kt = sign(/3 fc — (3i) 
for (3 k 7^ Pi and t k \ £ [—1,1] for (3 k = fy. Variables are considered fused if (3 k = (3i and unfused 
if (3 k 7^ Pi. They are active if p k 7^ 0, from which we see that in a set of fused variables, 
always all variables are either active or not. Thus, when "activating" coefficients, we always 
have to activate a whole set at once. In order to do this, we also have to keep track of sets 
of variables that are inactive by using s k as a substitute for p k . More precisely: 

Definition 3 Let Pf(X 2 ) be the number of sets of fused variables for penalty parameter A 2 
(keeping Ai fixed). Then for the sets F i} i = 1, . . . ,pf(X 2 ) to be valid, the following conditions 
have to hold: 

1. UZ^ 2) F t = {!,..., n} 



2. Ft n Fj = for i ^ j 



3. lfk,le Fi then /3 fc (A 2 ) = A(A 2 ) as well as s fc (A 2 ) = sj(A 2 ). If fc G Fjj G F,-,i 7^ j and Fj 
and Fj have a connecting edge, then fa{X 2 ) 7^ A(A 2 ) or Sfc(A 2 ) 7^ s/(A 2 ) for all penalty 
parameters in an interval (A 2 , A 2 + e) for some e > 0. 

4. If fc, / G Fj then k and / are connected in Q by only going over nodes in F, i.e. k, I are 
connected in the subgraph of £ induced by Fj. 

This definition is almost the same as Definition 2 except for a small change in point 3. 
This change reflects the adjustment for inactive coefficients as mentioned above. In addition 
to this, we also need a more formal definition of what active and inactive sets are. 

Definition 4 For penalty parameter X 2 (fixing Ai), let A^) be the set of active fused sets 
and A/"(A 2 ) be the set of inactive fused sets. Then 

A(X° 2 ) = {i\MX) ? OforA G (X° 2 ,X° 2 + e)} 

and 

Af(Xl) = {l,...,p F (X 2 )}\A(X° 2 ). 

Here, a set is defined as active at X 2 not if /3f;(A 2 ) 7^ but instead if /3f;(A 2 ) 7^ for 
A 2 > X 2 . There is only a difference between these two definitions at the breakpoints of the 
piecewise linear path. As we increase A from to 00, the definition for active sets was chosen 
to be "forward looking". This distinction is of a technical nature that is mostly important for 
the algorithm and proofs later. 

Now that we have defined fused and active sets, we will determine how fa as well as s k 
and tki change with A 2 . 

4.2 Derivation of the algorithm 
4.2.1 The loss function 

The derivation of the algorithm will work similar to the sections before. First, we will 
incorporate the conditions of fused and active sets into the loss function and then find the 
derivative of /^(A 2 ) with respect to A 2 . After this, we will determine how Sk and tu change 
with A 2 and use it to specify when sets of variables are becoming active or inactive and have 
to be fused or split. Afterwards, we will prove that the solution is piecewise linear in A 2 and 
assemble everything into a path algorithm. 

Before going into more details, we need to define some additional notation. Using the 
sets of fused variables and active sets, we specify a predictor matrix that incorporates this 
information. Using the sets of fused variables Fj for i — 1, . . . ,pf(X 2 ) define X F G R™ x p^( A2 ) 
with 

x f = Xfe 

and based on this we incorporate the active sets into X. F ' A G E nx '- 4 ^ A2 ^ by dropping all 
columns in X F that do not correspond to active sets. Similarly, by X- 4 we refer to the 
columns of X corresponding to coefficients in active sets, i.e. k G Fj with i G A(X 2 ), and 
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by X-^ to the columns of coefficients in inactive sets. Then the constrained loss function 
incorporating active and fused sets is 

L WuX2 (y,X,(3)= 1 -(y-X F (3 F ) T (y-X F (3 F ) + X 1 Yl \ F M\+ (3) 
+ X 2 J2 \{(k, I): he F h l G Fj}\ \p? - (3 F \ (4) 

i<j 

where f3 F is a vector such that (3[ = (3 Fi . Assuming that the fused and active sets are 
correct, the minimizer of Lp,AMM * s a ^ so the minimizer of the original loss function L\ ly \ 2 . 
Note that by definition of the fused sets Fi Pfj (except for a finite number of breakpoints). 
Therefore, the subgradient equations for this constrained loss function with respect to A 2 are 

dL wx (y,X,f3) = _ x F,T y + xF ,T xF(3F + XilFils ^ + 

+ A 2 \{(k,l):keF i leF j }\sign(p Fi -0 Fj )=O, 

where s Fi is a short form for Sk with k G Fi just as f3 Fi . Now, if we define vectors a and b 
with 

Oj = \Fi\s Fi and h = ^ \{(k, I) : k e F t l G Fj}\ sign ((3 F . - f3 Fj ) 

then we can write this optimality condition more compact as 

— w = -X y + X X (3 +Aia + A 2 b = 0. 

dp 

For % G *4.(A 2 ), we know that sf (A 2 ) = s_p i (A 2 ) = 1 locally w.r.t. A 2 and for % G jV(A 2 ) 
we have f3 F (A 2 ) = (3 F (X 2 ) = 0. Therefore, when taking derivatives with respect to A 2 we get 
(splitting into active and inactive sets) 

(X F ' A ) T X F ' A ^— + b A = 
oX 2 

as well as 

<9A 2 d\ 2 

We can solve this by 

^ = -((X^) r X^)"V (5) 



<9A 2 
and 

^ = -l((X-fX-^ + b-) (6) 

from which we get ds Fi /d\ 2 = (l/|Fj|) • (da^ /dX 2 ). Here, by the condition that rank(X) = p, 
it is guaranteed that ((X F " A ) T X F " A ) 1 exists. Therefore, we have calculated the derivative of 
(3 F and s F from Equation (3). From this, we can derive the slope of the path for f3 F as well as 
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determine when sets of coefficients become active or inactive. However, in order to see when 
to split a set, it remains to determine the behavior of t k i with respect to A 2 or equivalently of 
I'm = X 2 t H . We do this by inserting the solutions found above into the subgradient equations 
of the unconstrained problem (2). As before, we assume that we have a solution at A 2 that 
satisfies the subgradient equations. Then for any coefficient k E Fi we have 

-x^y + *lX A (3 A + X lSk + A 2 + ^ = ' 

l:{k,l)eE;k,leFi l:(k,l)eE;l£Fi 

again grouping the t k i by whether k and / belong to the same or different sets. When taking 
the derivative with respect to A 2 , we get 

l:{k,l)eE;k,leFi l:(k,l)eE;l£Fi 

In order to determine dr k i/dX 2 , we use the maximum flow setup from the general FLSA 
algorithm as well as that %i = sign(/9 fc — f3{) for k E F i: l E Fi. Then we define the push p k 
on node k as 

T „ A d(3 A ds k ^ 

p* = -** x -dx 2 - x 'dx 2 - 2. *« 

l:(k,l)eE;lgFi 

and therefore we have to solve 

P k= E ^ for fe = i,---,p 

l:(k,l)eE;k,leFi 

This is exactly the same problem as in Section 3.2.2 and we use the same maximum flow 
setup described there to find dr k i/dX 2 by setting dT k i/dX 2 = f k i where f k i is the maximal flow 
from node k to node / in F^ Using all this, we can now show that the solution is piecewise 
linear in A 2 in the following theorem: 

Theorem 4 For some A 2 , let F 1 , . . . , F pp ^ be a valid grouping of the variables. Let 
Qi = (Vi, Ei, Ci) be the with Fj associated maximum-flow graph as defined above. Also 
let (3k(X 2 )i Sfc(A 2 ) and r H (A 2 ) be a solution to the Fused Lasso problem for penalty parameter 
A 2 = A 2 . Assume that Qi has a maximum flow for which all flows coming from the source 
are at maximum capacity (i.e. f r \ = c r i for all (r,l) E Ei), and df3pJdX 2 is as defined in 
Equation (5) as well as ds k /dX 2 as in (6). Then there exists some A > such that for any 
A 2 E [A 2 , A 2 + A], the solution to the Fused Lasso problem is given by 

ft(A 2 ) = Pk(X° 2 ) + ^(X 2 )-(X 2 -X° 2 ) for kEFi 

ds F /x0 , 



s k (X 2 ) = s k (X° 2 ) + ^(X° 2 ) ■ (X 2 - X° 2 ) for kEFi 



and 



Tkl(X 2 ) 



Tki(X 2 ) + fki(X 2 - \%) for k,l E Fi for some i 
sign(r M (A 2 ))A 2 otherwise. 
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The proof is very similar to the one for Theorem 3 and again can be found in Section 4.4. 
As before, we still have to define what A is. 



4,2,2 Activation and deactivation time 

In order to define A, we have to find the value of A 2 at which the fused sets themselves or 
their activation status changes. The hitting time h(X 2 ) and violation time v (A2) are the same 
as in Section 3.2.4. In addition to this we also have to define the value of A 2 at which sets 
are activated or deactivated. The activation time act(X 2 ) is 

act(A 2 ) = min actj(A 2 ) 

ieAf(X 2 ) 

where 



( 1 - s ^ 


+ A 2 


if fr>0 

<9A 2 


ds p. 


dx 2 






8 F .+1 
I da F i 


+ A 2 












otherwise 



acti(X 2 ) = 



is the activation time of set F, with % G A/"(A 2 ). The deactivation time on the other hand 
d(X 2 ) is 



where 



di(X 7 



d(X 2 ) = min di(X 2 ) 



'^ + A 2 if^<0,/3 Fi >0 
h +A 2 if||>0,^<0 



[00 



otherwise 



is the deactivation time of the active set Fj. Putting all this together, the length of the linear 
segment to the next breakpoint is 

A(A 2 ) = min {h(X 2 ),v(X 2 ), act(X 2 ), d(X 2 )} — X 2 . 

Therefore, we have now defined all the necessary information on the previous theorem. 

4.2.3 Changing sets of variables 

What remains to be done is to find the rule on how to split and fuse sets of variables as 
well as how to activate or inactivate a set. First, the rule for splitting and fusing sets is very 
similar to the one in Section 3.2.4 with the only change being in step one to take the Sk into 
account so that inactive sets can also be fused. 

1. If there are sets Fj and Fj for which 3k G F and I G Fj with (fc, I) G E and Pp i = j3p i 
and s Fi = s Fj , then fuse these sets into a new set F^- = F; U Fj if — < 0, 
- ^ < and t kl = 1. 



2. If there is a set F, for which in the associated maximal flow graph not all edges coming 
from the source are at maximal capacity, then split F in the two subsets Ri and Si as 
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described above. 



3. Iterate steps 2 and 3 until nothing changes. 
Second, the rule for activating and inactivating the sets is then: 

1. If for an inactive set F we have > if = 1 or < if = — 1, then activate 
set F t . 

2. If for an active set F, we have (5 Fi = and s Fi = 1, < or s Fi = —1, > 0, 
then deactivate the set. 



When we use these two rules for values of A 2 at which we have to adapt the fused and active 
sets, then the resulting sets will be valid. More precisely, the following proposition will hold: 

Proposition 2 Assume that sets of variables F i} i = 1, . . . ,p F (\ 2 —) are given (where A 2 — 
denotes a limit from the left) with optimal values of (3 Fi {\ 2 ), s Fi (X 2 ) and t k i(\ 2 ) for all k,l. 
Furthermore assume that d(3 Fi /d\ 2 and ds F Jd\ 2 are calculated as defined in equations (5) 
and (6) using the sets F;. When using the two rules above for changing the fused sets and 
the activity status, the resulting sets will be valid according to Definition 4.1 and 4.1. 

The proof can be found in Section 4.5. This proposition completes the necessary steps 
for the path algorithm and we can now present an outline of the whole algorithm. 



4.3 Outline of the algorithm 

The algorithm is an extension of the FLSA algorithm that incorporates the active and inactive 
sets. In the case of the general FLSA algorithm, finding the starting value for A 2 = was 
particularly easy. For general Fused Lasso, finding the starting value needs an additional 
step. For A 2 = 0, the loss function is 

1 P 
L Al ,o(y, X, (3) = -(y - X/3) T ( y - X/3) + X 1 £ \(3 k \ 

k=i 

which is just the loss function of the regular Lasso. Therefore, we can find the starting values 
for j3 by solving a regular Lasso problem. In order to find the starting value for s, we have 
to look at the subgradient equations: 

= -(X T y) fe + (X T X/3) fc + X lSk = 

Opt 

and thus 

S = ^(X T y-X T X/3). 

Using these starting values we now have the complete algorithm which can be seen in 
Algorithm 1. 

Most of the components of this algorithm are very similar to the FLSA algorithm. How- 
ever, many efficiency gains that we could get in the FLSA case are not possible for the general 
Fused Lasso. For the FLSA, the derivative of f3 Fi and t k i for k,l e Fj only had to be updated 
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Algorithm 1: General Fused Lasso path algorithm 

initialize 

A 2 = 0; 

Ft = {i} for % = 1, .. . ,p; 

pf = p; 

Find f3k and Sk for k — 1, . . . , n by solving the Lasso problem; 
end 

while p F > 1 do 

Update f3, s and t ; 

Calculate the derivatives of fa and w.r.t. A 2 for i = 1, . . . 
Solve maximum flow problem for Fi, i — 1, . . . ,p^(A 2 ); 
Calculate the next hitting time h(X 2 ); 
Calculate the next violation time v (A 2 ); 

Calculate the next time a set will be activated act(X 2 ) or deactivated (i(A 2 ); 
Set A(A 2 ) = min{/i(A 2 ),t;(A 2 ),a(A 2 ),d(A 2 )} - A 2 ; 
if ^(A 2 ) = A(A 2 ) + A 2 then 
Fuse the two sets Fi and Fj\ 

Pf ■= Pf - 1; 
A 2 := ^ij(A 2 ); 
else if i>j(A 2 ) = A(A 2 ) + A 2 then 
if Vi(X 2 ) = A 2 then 

Split the set Fi into two smaller sets; 

Pf ■= Pf + 1 ; 
end 

A 2 := Vi(X 2 ) ; 
else if a«(A 2 ) = A(A 2 ) + A 2 then 

Activate the set Ff, 

X 2 := ai(X 2 ) ; 
else if di(X 2 ) = A(A 2 ) + A 2 then 

Deactivate the set Fi ; 

A 2 := di(X 2 ) ; 

end 
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if Fi was the result of the last split or fusion. For the Fused Lasso on the other hand, if only 
one of the active groups changes its composition, the path for all other groups are affected as 
well and have to be recomputed, including the maximum flow problems. Only inactive groups 
can be treated as in the FLSA case. Another computational bottleneck is the calculation of 
((x^fx^p. It is worthwhile to note that it is not necessary to recompute the whole 
matrix in every step. As from one step to the next, only very few columns of X^- 4 change, 
it is possible to get the new matrix inverse by up- and downdating the old matrix inverse. 

Overall, the General Fused Lasso algorithm is computationally a lot more complex than 
the FLSA case. Apart from this, the more immediate applications are for the FLSA, especially 
the one-dimensional case. Therefore, we will only publish implementations for the FLSA cases 
at the moment and delay a program that solves the general version of the Fused Lasso to a 
later date. 

4.4 Proof of Theorem 4.2.1 

Proof. By the assumption, the values for f3 Ft (^i)i ^(^2) an d ^(^2) satisfy the subgradient 
equations (2) for the value A 2 = A°. Then the subgradient equations for value A 2 G [A°, A°+A] 
are 

- (X T y) fc + (X T X/3(A 2 )) fe + A lSfc (A 2 ) + A 2 ^ t kl (X 2 ) 

l:(k,l)€E 

= -x T k y + x T k Xf3(\ 2 ) + \ lSk (\ 2 ) + £ r kl (X° 2 ) + £ r kl (X° 2 ) 

l:(k,l)eE,lgFi l:(k,l)eE,leFi 

+ (A2-A°)(xrx|^ + A 1 ^+ £ tM)+ Yl /« 

\ 2 2 l:(k,l)eE,l<£Fi l:(k,l)eE,leFi 

= 

and we have to show that they are equal to zero. As by assumption /3(A°), s(A 2 ) and r(A 2 ) 
are by assumption optimal, it suffices to show that for all k 

l:(k,l)eE,lgFi l:(k,l)eE,leFi 

By the definition of the push on node k, this in turn is equivalent to 

-Pk + 5^ f kl = °' 

l:(k,l)eE,leFi 

In order to see this, observe that by assumption in the theorem, all nodes coming from 
the source are at maximum capacity. Furthermore, note that as t k i = —ti k , we also have that 

- \{(k, l):ke F it l G Fj}\ sign (fc - (3 Fj ) = 
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as can be seen in the derivation of the algorithm. Therefore, 



P k 



k&Fi, Pk >0 



k&F t , Pk <0 



and thus, if all flows coming from the source are at capacity, so are all flows going into the 
sink. As the flows going into and out of every regular node k in the maximum flow problem 
sum up to 0, this just implies that for every node k 



4.5 Proof of Proposition 4.2.3 

Proof. This proof is very similar to the proof of Proposition 2 above. 

Before we start with the main part, note first that /3(A 2 ) is a continuous function of A2. 
Under the condition that rank(X) = p, the loss function L\ lj \ 2 (y, X,/3) is strictly convex in (3 
and affine in A 2 . From this we can immediately see that ming L Al A2 (y, X, (3) is a continuous 
function of A2. However, then the strict convexity in (3 of L implies the continuity of the 
solution /3(A 2 ). 

Now back to the actual proof. In order for a set Fj to be valid at A 2 , it has to hold that 
for any other set Fj that /3f;(A 2 ) 7^ Pf^M) or s_f 4 (A 2 ) 7^ sf,(A 2 ) for A 2 G (A 2 , A 2 +e) for some 
e > 0. 

If (A°) 7^ /9fj(A 2 ) or sf;(A 2 ) 7^ Si?.(A 2 ), then this clearly holds as (3 and s are continuous 
in A 2 . 

If (3 Fi = Pfj and SFi = sf, , then the two sets will be fused unless one of the conditions 
d/3Fi/d\2 — d^FjdXi < 0, dsFi/d\ 2 — ds F Jd\2 < or t ki = 1 is violated. First, as F and 
Fj are different sets, we have that t k i = ±1 and as tki = —tik by definition, we can assume 
without loss of generality that t ki = 1. If either one of the other two conditions fails, then 
either /3p i (A 2 ) > /3f,(A 2 ) or SF t (A 2 ) > s^.(A 2 ) for A 2 G (A 2 , A° +e) and it is also not violating 
the restriction that tki = sign(/5fc — Pi) for (3 k ^ (3i where (k, I) G E with k G Fi and / G Fj. 
Therefore, the set is valid. 

Another concern that we have to address is that we have to ensure the existence of t k i 
that satisfy the constraints t k i G [—1, 1] for k,l G Fj for all i. However, if a flow exists in Q i} 
by the construction of the flow graph, this holds. 

Therefore, for the fusing and splitting steps, it remains to show that there cannot be an 
infinite loop of fusions and splits. As there is only a finite number of nodes, it is enough to 
show that if two sets Fj and Fj have just been merged, they cannot be split again into Fj 
and Fj immediately at A 2 = A 2 . 

In order to do this, assume that there is only one breakpoint exactly at A 2 = A 2 . Otherwise 
let e ~ A r (0, /„) and perturb y by using y + ce where c > and send c — > 0. The probability 
of two breakpoints occurring for the same A 2 for the random y + ce is and the solutions 
are all linear in c, thus the solution if just y is being used is equal to the limit of the solution 
using y + ce for c — > 0. 

Therefore, we can without loss of generality assume that there is only one breakpoint. 
Also, we assume that /3i^(A 2 ) = /3f,-(A 2 ) as well as /3 J F i (A 2 ) > /3f,-(A 2 ) for A 2 G (A 2 — e, A 2 ) and 



Pk + ^2 f' kl = 



l:(k,l)eE,leFi 



holds. And this finishes the proof of the theorem. 



□ 
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therefore d(3 F Jd\2 < dp Fj /d\ 2 (the same remains true when we use s instead of (3). This 
implies that the two groups F» and Fj have to remain fused for at least an interval [A°, A 2 ] 
for A° < A3,. To see this, observe that there are only two other possible cases: 

/3f 1 (A 2 ) > /^(A 2 ) for A 2 > Ajj: This would require that <9/3 Fi /<9A 2 (A 2 ) > d/3 F ./dA 2 (A 2 ) as 
^(^2) = /^(Aij)- However, this is not possible as in this case, equation (5) did not 
change compared to A 2 < A°. Therefore df3 F Jd\ 2 < df3 F Jd\ 2 as before, contradicting 
/^A 2 )>/? F .(A 2 )forA 2 >A°. 

/3f;(A 2 ) < (3 Fj {\ 2 ) for A 2 > A°: In this case we would have that due to the continuity of (3 it 
holds 

8L FM , (y,X,/3) = aj^M0) + ^ m k e n l e Fj}| 

By assumption we know that /3(A 2 ) is optimal for A 2 < A° and thus 

L F , AMM (y,X,(3) = for A 2 < A°. 

However this implies then that /3(A 2 ) for which f3 Fi (\ 2 ) < f3 Fj (\ 2 ) can not be optimal 
for A 2 > A 2 , again contradicting the assumption. 

This shows that a fused group F U Fj has to remain fused for an interval [A°, A 2 ]. However, 
for our algorithm it is necessary that then a flow exists for which all flows coming from the 
source are at maximum capacity. As the grouping with Fi U Fj is optimal for [A 2 , A 2 ], using 
the subgradient equations (7) as well as the linearity of (3 A and s in A 2 and that tui = const, 
for (k, I) e E, k e Fi U Fj and I £ Fj U Fj we have that the r M for (k, I) G E;k,l G F U Fj 
have to satisfy a linear equation of the form 

^ r kl + A 2 c fe = 4 for k G F U F,- 

{k,l)eE;leFiUFj 

for some vectors and 4- As the grouping is optimal, we also know that a solution exists 
for every A 2 G [A°, A2]. The space of all solutions of this linear equation is a vector space 
and thus we know that there exist linear functions tm(A 2 ) that satisfy these equations. But 
then, the flows fki = <9r^/(9A 2 satisfy the maximum flow problem of the graph Q Fi \j Fj with 
all flows from the source at maximum capacity (as the maximum flow problem solves the 
above equations after taking the derivative w.r.t. A 2 ). Therefore, the group F U Fj will not 
be forced to break up by our algorithm. 

Now the only thing that remains to do is to make sure that the rules for activating and 
deactivating sets are correct. In order to see this, we again use the continuity and piecewise 
linearity of (3 w.r.t. A 2 . It is easy to see, that this also implies continuity and piecewise 
linearity for s w.r.t. A 2 . 

Activate an inactive set: Let F be the inactive where w.l.o.g. SFi(A 2 ) = 1 as well as 

PfX^) = an d (9s Fi /dX 2 )(X2) > 0- First, the set cannot remain inactive for A 2 > A 2 , 
as then we would have (ds Fi /d\ 2 )(\2) < 0, which is a contradiction. Also (9/3 Fi /9A 2 )(A 2 ) < 
is not possible, as then s(A 2 ) = —1 for A 2 > A 2 , which would violate the continuity of 
s. Therefore, (<9 ( 5f 1 /^A 2 )(A 2 ) > and thus /3f;(A 2 ) > for A 2 > AS] and then F satisfies 
the activity condition. 
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Deactivate an active set: Let F be active where w.l.o.g. s^A") = 1, Af^A") = and 
(df3 F Jd\ 2 )(\ 2 ) < 0- Then due to the negative derivative, we have f3 Fi (^2) < for 
A 2 > A°. However, if (3 Fi (\ 2 ) < 0, then s Fi (\ 2 ) = — 1, again violating the continuity of 
s. Thus, we have that (3 Fi (\ 2 ) = for A 2 G (\ 2 , \ 2 +e) and therefore, the condition for 
being active is not satisfied for Fj. Thus it is inactive. 

□ 
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